Your organization wants to know which companies are similar to each other to help in identifying potential customers of a SAAS software solution (e.g. Salesforce CRM or equivalent) in various segments of the market. The Sales Department is very interested in this analysis, which will help them more easily penetrate various market segments.
You will be using stock prices in this analysis. You come up with a method to classify companies based on how their stocks trade using their daily stock returns (percentage movement from one day to the next). This analysis will help your organization determine which companies are related to each other (competitors and have similar attributes).
You can analyze the stock prices using what you’ve learned in the unsupervised learning tools including K-Means and UMAP. You will use a combination of kmeans() to find groups and umap() to visualize similarity of daily stock returns.
2 Objectives
Apply your knowledge on K-Means and UMAP along with dplyr, ggplot2, and purrr to create a visualization that identifies subgroups in the S&P 500 Index. You will specifically apply:
#> Loading required package: PerformanceAnalytics
#> Loading required package: xts
#> Loading required package: zoo
#>
#> Attaching package: 'zoo'
#>
#> The following objects are masked from 'package:base':
#>
#> as.Date, as.Date.numeric
#>
#>
#> ######################### Warning from 'xts' package ##########################
#> # #
#> # The dplyr lag() function breaks how base R's lag() function is supposed to #
#> # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
#> # source() into this session won't work correctly. #
#> # #
#> # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
#> # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
#> # dplyr from breaking base R's lag() function. #
#> # #
#> # Code in packages is not affected. It's protected by R's namespace mechanism #
#> # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
#> # #
#> ###############################################################################
#>
#> Attaching package: 'xts'
#>
#> The following objects are masked from 'package:dplyr':
#>
#> first, last
#>
#>
#> Attaching package: 'PerformanceAnalytics'
#>
#> The following object is masked from 'package:graphics':
#>
#> legend
#>
#> Loading required package: quantmod
#> Loading required package: TTR
#> Registered S3 method overwritten by 'quantmod':
#> method from
#> as.zoo.data.frame zoo
library(broom)library(umap)
4 Data
We will be using stock prices in this analysis. Although some of you know already how to use an API to retrieve stock prices I obtained the stock prices for every stock in the S&P 500 index for you already. The files are saved in the session_6_data directory.
We can read in the stock prices. The data is 1.2M observations. The most important columns for our analysis are:
symbol: The stock ticker symbol that corresponds to a company’s stock price
date: The timestamp relating the symbol to the share price at that point in time
adjusted: The stock price, adjusted for any splits and dividends (we use this when analyzing stock data over long periods of time)
Answering this question helps us understand which companies are related, and we can use clustering to help us answer it!
Even if you’re not interested in finance, this is still a great analysis because it will tell you which companies are competitors and which are likely in the same space (often called sectors) and can be categorized together. Bottom line - This analysis can help you better understand the dynamics of the market and competition, which is useful for all types of analyses from finance to sales to marketing.
Let’s get started.
5.1 Step 1 - Convert stock prices to a standardized format (daily returns)
What you first need to do is get the data in a format that can be converted to a “user-item” style matrix. The challenge here is to connect the dots between what we have and what we need to do to format it properly.
We know that in order to compare the data, it needs to be standardized or normalized. Why? Because we cannot compare values (stock prices) that are of completely different magnitudes. In order to standardize, we will convert from adjusted stock price (dollar value) to daily returns (percent change from previous day). Here is the formula.
First, what do we have? We have stock prices for every stock in the SP 500 Index, which is the daily stock prices for over 500 stocks. The data set is over 1.2M observations.
Your first task is to convert to a tibble named sp_500_daily_returns_tbl by performing the following operations:
Select the symbol, date and adjusted columns
Filter to dates beginning in the year 2018 and beyond.
Compute a Lag of 1 day on the adjusted stock price. Be sure to group by symbol first, otherwise we will have lags computed using values from the previous stock in the data frame.
Remove a NA values from the lagging operation
Compute the difference between adjusted and the lag
Compute the percentage difference by dividing the difference by that lag. Name this column pct_return.
Return only the symbol, date, and pct_return columns
The next step is to convert to a user-item format with the symbol in the first column and every other column the value of the daily returns (pct_return) for every stock at each date.
We’re going to import the correct results first (just in case you were not able to complete the last step).
Now that we have the daily returns (percentage change from one day to the next), we can convert to a user-item format. The user in this case is the symbol (company), and the item in this case is the pct_return at each date.
Spread the date column to get the values as percentage returns. Make sure to fill an NA values with zeros.
Next, we want to combine the layout from the umap_results with the symbol column from the stock_date_matrix_tbl.
Start with umap_results$layout
Convert from a matrix data type to a tibble with as_tibble()
Bind the columns of the umap tibble with the symbol column from the stock_date_matrix_tbl.
Save the results as umap_results_tbl.
# Convert umap results to tibble with symbolsumap_results_tbl <- umap_results$layout %>%as_tibble() %>%bind_cols(stock_date_matrix_tbl %>%select(symbol))
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
#> `.name_repair` is omitted as of tibble 2.0.0.
#> ℹ Using compatibility `.name_repair`.
# Output: umap_results_tbl
Finally, let’s make a quick visualization of the umap_results_tbl.
Pipe the umap_results_tbl into ggplot() mapping the columns to x-axis and y-axis
Add a geom_point() geometry with an alpha = 0.5
Apply theme_tq() and add a title “UMAP Projection”
First, pull out the K-Means for 10 Centers. Use this since beyond this value the Scree Plot flattens. Have a look at the business case to recall how that works.
# Get the k_means_obj from the 10th centerk_means_obj <- k_means_mapped_tbl %>%filter(centers ==10)# Store as k_means_obj
Next, we’ll combine the clusters from the k_means_obj with the umap_results_tbl.
Begin with the k_means_obj
Augment the k_means_obj with the stock_date_matrix_tbl to get the clusters added to the end of the tibble
Select just the symbol and .cluster columns
Left join the result with the umap_results_tbl by the symbol column
Left join the result with the result of sp_500_index_tbl %>% select(symbol, company, sector) by the symbol column.
Store the output as umap_kmeans_results_tbl
# Use your dplyr & broom skills to combine the k_means_obj with the umap_results_tblumap_kmeans_results_tbl <- k_means_obj %>%bind_cols(stock_date_matrix_tbl) %>%mutate(.cluster = .$centers) %>%select(symbol, .cluster) %>%left_join(umap_results_tbl, by ="symbol") %>%left_join(sp_500_index_tbl %>%select(symbol, company, sector), by ="symbol")
Plot the K-Means and UMAP results.
Begin with the umap_kmeans_results_tbl
Use ggplot() mapping V1, V2 and color = .cluster
Add the geom_point() geometry with alpha = 0.5
Apply colors as you desire (e.g. scale_color_manual(values = palette_light() %>% rep(3)))
# Visualize the combined K-Means and UMAP resultsumap_kmeans_results_tbl %>%ggplot(aes(x = V1, y = V2, color =factor(.cluster))) +geom_point(alpha =0.5) +scale_color_manual(values =palette_light() %>%rep(3)) # Adjust colors as desired
#> Warning: No shared levels found between `names(values)` of the manual scale and the
#> data's colour values.
#> No shared levels found between `names(values)` of the manual scale and the
#> data's colour values.
Congratulations! You are done with the 1st challenge!